-
Notifications
You must be signed in to change notification settings - Fork 423
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
UCM/UCS: Fail to create memtype cache if cannot patch Cuda driver API #7865
UCM/UCS: Fail to create memtype cache if cannot patch Cuda driver API #7865
Conversation
7c67e7b
to
537eb4b
Compare
537eb4b
to
8e1b3be
Compare
@yosefe is this only for static or this will impact dynamic linked cuda applications as well ? |
it will affect both, to use only driver-api hooks instead of runtime/driver. |
@yosefe in the past you have been avoiding this because of high cost (system call). Does it mean that performance will go down ? |
Do you mean memory type detection? |
@yosefe got it. thanks |
@Akshay-Venkatesh @bureddy can you pls take a look? |
per offline discussion with @bureddy and @Akshay-Venkatesh : will keep reloc hooks for Cuda runtime API as a disabled option, for applications that use dynamic link and fail to patch driver API for some reason. |
@Akshay-Venkatesh @bureddy can you pls take a look? |
Sorry for the delay @yosefe Taking a look now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yosefe Probably missed this is in the diff but
1. Can you point to how driver hooks installation failure prevents memtype_cache from being created?
- Is it the case that memtype_cache creation calls
ucm_cudamem_install
(which doesn't return UCS_OK if driver hooks failed) and this leads to disabling memtype cache when bistro and reloc methods fail on driver functions?
2. Is it the case that memtype_cache being absent prevents rcache instance from being created and avoids potential data corruption?
goto out_unlock; | ||
} | ||
|
||
status = ucm_cuda_install_hooks(ucm_cuda_runtime_funcs, "runtime", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to attempt installation of runtime hooks here? If we're not going to create memtype cache if driver hooks failed, can we not skip this step?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if driver bistro hooks failed, we do skip the other steps:
status = ucm_cuda_install_hooks(ucm_cuda_driver_funcs, "driver",
UCM_MMAP_HOOK_BISTRO, &driver_api_hooks);
if (status != UCS_OK) {
**goto out_unlock;**
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean can we not remove lines 282-285?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
these lines are needed in case the user would like to set UCX_MEM_CUDA_HOOK_MODE=reloc manually
ucs_memtype_cache_failed = 1; | ||
if (ucs_global_opts.enable_memtype_cache == UCS_YES) { | ||
ucs_warn("failed to create memtype cache: %s", | ||
ucs_status_string(status)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should it be a hard error if the user is forced to enable memtype cache?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think it should be a fatal error since we can still continue the program without memtype cache
@yosefe If you don't mind, can you include in PR description, the current conditions under which
|
@Akshay-Venkatesh I've updated PR desc. Hope it answers the previous questions as well. |
- Reloc hooks are optional but are not the default. Default is bistro hooks on Cuda driver API. - If UCM fails to install the configured hooks, do not create the memory type cache. - If failed to create memtype cache once, don't try again. - Use getauxv() API if possible instead of reading /proc/self/auxv directly - fixes permissions errors on some systems. - Enable Cuda bistro hooks also with valgrind, since it doesn't affect heap memory allocations. - Fix error message in tests.
2fd5f72
to
119a81b
Compare
Why
Many Cuda applications use static link, so it's not safe to assume that relocation-based memory hooks on Cuda runtime API can be enough. To be on the safe safe, fail to create the registration cache and (by default) fallback to pointer-query based memory detection.
Note: As a side effect, this will also disable RDMA registration cache when could not install Cuda hooks, since not knowing about a Cuda memory release can lead to stale memory keys in the cache and a data corruption.
How
The logic is:
UCX_MEM_CUDA_HOOK_MODE=bistro (default)
:If also
UCX_MEMTYPE_CACHE=y
(default istry
) - a warning is printed.UCX_MEM_CUDA_HOOK_MODE=reloc
:This method sets reloc hooks on both driver and runtime APIs.
If either driver or runtime API reloc hooks fail (it's not expected) - memtype cache and rcache are disabled.
This method is safe only if the application is dynamically linked to cuda runtime API (or using the driver API directly), so it's not the default.
Debugging
Set "UCX_MEM_LOG_LEVEL=diag" to get more info if failed to install memory hooks.
Related to #7791 (comment)
cc @Akshay-Venkatesh @pentschev